The TSstudio provides a set of tools for time series analysis and forecasting. The following document demonstrates the usage of the package to forecast the monthly consumption of natural gas in the US in the next 5 years (or 60 months)
Install from CRAN:
install.packages("TSstudio")
Or from Github:
devtools::install_github("RamiKrispin/TSstudio")
As for December 2018, the most updated version is 0.1.3:
library(TSstudio)
packageVersion("TSstudio")
## [1] '0.1.3'
The USgas dataset, one of the package datasets, represents the monthly consumption (in Billion Cubic Feet) of natural gas in the US since January 2000 [1]:
# Load the series
data("USgas")
# Get the series info
ts_info(USgas)
## The USgas series is a ts object with 1 variable and 225 observations
## Frequency: 12
## Start time: 2000 1
## End time: 2018 9
# Plot the series
ts_plot(USgas,
title = "US Monthly Natural Gas Consumption",
Ytitle = "Billion Cubic Feet",
Xtitle = "Source: Federal Reserve Bank of St. Louis",
slider = TRUE)
You can note from the plot above that the series has strong seasonality and fairly stable trend (or growth). We will use the ts_seasonal and the ts_heatmap the explore further the seasonality and trend of the series:
ts_seasonal(USgas, type = "all")
The ts_seasonal function provides three different views of the seasonality of the series (when the type argument is equal to "all"):
The combination of the three plots provides the full picture on the series seaosnality. Alternativliy, you can use the ts_heatmap to get a time series heatmap represenative of the series:
ts_heatmap(USgas)
The next step is to identify the level of correlation between the series and it’s lags, using the ts_acf function, which is nothing but an interactive version of the acf function:
ts_acf(USgas, lag.max = 36)
As expected you can notice that the series is highely correlated with it’s seasonal lags. A more intuative way to review identify a deoandency between the series and its lags is with the ts_lags function, which provides plots of the series against its lags:
ts_lags(USgas)
As observed before with the ts_acf function, you can notice that the seasonal lag (or lag 12) of the series has a strong linear relationship with the series. In a semilar way we can zoom in on the seasonal lags of the series using the lags argument:
ts_lags(USgas, lags = c(12, 24, 36, 48))
Using the information we learned from the seasonal and correlation analysis, we will conduct “horse race” between several models and select the one that performe best on the testing set, by using two training approaches:
Traditional approach - by splitting the series into training and testing (sample out) partitions. Train each model on the training set and evluate its perfromance on the testing set. We will use the following four models - auto.arima, ets, nnetar and tslm from the forecast package
Backtesting - by using expending window to train and test each model on multiple training and testing sets. We will utilize the ts_backtesting function to train multiple models from the forecast, forecastHybrid, and bsts packages.
# Set the sample out and forecast horizon
h1 <- 12 # sample out lenght
h2 <- 60 # forecast horizon
USgas_split <- ts_split(USgas, sample.out = h1)
train <- USgas_split$train
test <- USgas_split$test
ts_info(train)
## The train series is a ts object with 1 variable and 213 observations
## Frequency: 12
## Start time: 2000 1
## End time: 2017 9
ts_info(test)
## The test series is a ts object with 1 variable and 12 observations
## Frequency: 12
## Start time: 2017 10
## End time: 2018 9
set.seed(1234)
library(forecast)
# auto.arima
md1 <- auto.arima(train,
stepwise = FALSE,
approximation = FALSE,
D = 1)
fc1 <- forecast(md1, h = h1)
accuracy(fc1, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -1.66563 98.53365 72.04795 -0.2777977 3.497193 0.6743267
## Test set 174.92468 196.47007 174.92468 7.0405671 7.040567 1.6371928
## ACF1 Theil's U
## Training set 0.02073135 NA
## Test set -0.15231904 0.5813516
# ETS
md2 <- ets(train, opt.crit = "mse")
fc2 <- forecast(md2, h = h1)
accuracy(fc2, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.447545 99.80075 73.6057 0.04715887 3.595929 0.6889063
## Test set 111.474912 150.42645 140.6081 4.76949862 5.848864 1.3160094
## ACF1 Theil's U
## Training set 0.07295528 NA
## Test set 0.13042602 0.4570809
# Neural network time series forecasts
md3 <- nnetar(train,
P = 2, # using 2 seasonal lags
p = 1, # and 1 non-seasonal lags
repeats = 100)
fc3 <- forecast(md3, h = h1)
accuracy(fc3, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02583563 120.4015 94.33888 -0.3202311 4.588517 0.8829568
## Test set 181.70905836 214.1088 181.70906 6.9434937 6.943494 1.7006906
## ACF1 Theil's U
## Training set 0.3922585 NA
## Test set 0.4134646 0.5800586
# Time series linear regression
md4 <- tslm(train ~ season + trend)
fc4 <- forecast(md4, h = h1)
accuracy(fc4, test)
## ME RMSE MAE MPE
## Training set -0.000000000000001043571 117.2227 89.62472 -0.3040303
## Test set 185.353111290177963610404 209.1038 190.04833 7.4706371
## MAPE MASE ACF1 Theil's U
## Training set 4.479828 0.838835 0.5742214 NA
## Test set 7.644592 1.778741 -0.1787256 0.6243439
We will utlize the test_forecast function to visualize the goodness of fit of each model on both the training and testing partitions:
test_forecast(forecast.obj = fc1, actual = USgas, test = test)
test_forecast(forecast.obj = fc2, actual = USgas, test = test)
test_forecast(forecast.obj = fc3, actual = USgas, test = test)
test_forecast(forecast.obj = fc4, actual = USgas, test = test)
Looking at the results above, we can see that ets model achived the lowest error rate on testing set (both RMSE and MAPE) and therefore we will use this model to forecast the monthly consumption in the next 5 years. We will first retrain the model on all the series and review the residuals:
md2a <- ets(USgas, opt.crit = "mse")
fc2a <- forecast(md2a, h = h2)
plot_forecast(fc2a)
We will use the ts_backtesting to train and test multiple models (auto.arima, ets, HoltWinters, nnetar, tbats, from the forecast package, hybridModel from the forecastHybrid package and bsts from the bsts package) over six periods of time (periods = 6). Likewise as we did in the traditional approach above, we will set the testing partition to 12 months (h = h1) and the forecast horizon to 60 months (h = h2)
md5 <- ts_backtesting(ts.obj = USgas,
periods = 6,
error = "RMSE",
window_size = h1,
h = h2,
a.arg = list(stepwise = FALSE,
approximation = FALSE,
D = 1),
e.arg = list(opt.crit = "mse"),
n.arg = list(P = 2,
p =1,
repeats = 100),
h.arg = list(errorMethod = "RMSE",
verbos = FALSE))
## Model_Name avgMAPE sdMAPE avgRMSE sdRMSE
## 1 auto.arima 6.305000 0.5681813 187.9250 7.128455
## 2 hybrid 6.701667 0.7568994 195.8167 19.133263
## 3 ets 7.778333 1.7040824 202.5533 45.956891
## 4 tbats 7.468333 0.7641313 208.5633 17.602901
## 5 bsts 7.271667 2.1809119 210.0217 50.911591
## 6 nnetar 6.753333 0.1978552 213.0033 2.394850
## 7 HoltWinters 7.736667 0.6686903 218.5017 19.057048
md5$summary_plot
The main advantage the backtesting approach, over the traditional approach, is that it provides an overview of the performance of each model overtime. This allow you to identify, in additional to accuracy, the stability of the model’s performance overtime. Looking at the summary plot above, you can notice that:
nnetar model is the most stable model, as the range of its RMSE is fairly small.auto.arima model achived the lowest RMSE and MAPE on the testing partitions (since we defined the model selection critirion as RMSE, the function select this forecasting approach)bsts or the ets models, as their error consitenly dropping overtime[1] U.S. Bureau of Transportation Statistics, Natural Gas Consumption [NATURALGAS], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/NATURALGAS, January 7, 2018.